Skip to content

Conversation

@MasterPtato
Copy link
Contributor

@MasterPtato MasterPtato commented Jan 12, 2026

Fixes RVT-5330

Copy link
Contributor Author

MasterPtato commented Jan 12, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@linear
Copy link

linear bot commented Jan 12, 2026

@claude
Copy link

claude bot commented Jan 12, 2026

PR Review: Fix restart policy actors sleeping after pending timeout

Summary

This PR addresses RVT-5330 by ensuring that Rivet Actors with CrashPolicy::Restart will transition to sleep state if they remain pending (unable to allocate) for more than 60 seconds. This prevents actors from getting stuck in a pending state indefinitely.

Code Quality ✅

Strengths:

  • Clean implementation following existing patterns in the codebase
  • Good constant naming (RESTART_PENDING_TIMEOUT_MS) with clear documentation
  • Proper use of the existing AllocationOverride enum pattern
  • Follows the repository's structured logging conventions (lowercase messages, structured fields)

Minor observations:

  • The new variant PendingTimeout is well-integrated into the existing enum structure
  • Comment formatting follows the CLAUDE.md guidelines (complete sentences, no dashes)

Functionality ✅

The logic is sound:

  1. New constant (mod.rs:18): Defines a 60-second timeout for restart policy actors
  2. Updated handle_stopped (mod.rs:1057-1074): When a CrashPolicy::Restart actor crashes, it now reschedules with AllocationOverride::PendingTimeout
  3. Extended spawn_actor (runtime.rs:766-777): The timeout handling now includes the new PendingTimeout variant alongside DontSleep
  4. Correct state transitions: The PR properly handles the SpawnActorOutput::Sleep case (line 1069), which was previously commented as impossible

Potential Issues 🔍

1. Comment removal justification (mod.rs:1057-1070)

The removed comment stated:

// NOTE: Its not possible for `SpawnActorOutput::Sleep` to be returned here, the crash
// policy is `Restart`.

This comment is now obsolete since SpawnActorOutput::Sleep can be returned when the pending timeout expires. However, the PR doesn't add any logging when this transition occurs. Consider:

runtime::SpawnActorOutput::Sleep => {
    tracing::debug!(
        actor_id=?input.actor_id,
        ?RESTART_PENDING_TIMEOUT_MS,
        "restart policy actor sleeping after pending timeout"
    );
}

2. Missing state management

When SpawnActorOutput::Sleep is returned (line 1069), the code doesn't set state.sleeping = true or call SetSleepingInput activity. Looking at other places in the code (e.g., lines 1078-1086), sleeping actors need proper state updates. This could lead to inconsistent state.

Suggested fix:

runtime::SpawnActorOutput::Sleep => {
    state.sleeping = true;

    ctx.activity(runtime::SetSleepingInput {
        actor_id: input.actor_id,
    })
    .await?;
}

3. No failure reason update

When an actor times out pending and goes to sleep, it might be valuable to set or preserve a NoCapacity failure reason so users understand why their restart-policy actor is sleeping. Currently, if the actor had a crash-related failure reason, it would remain, which could be confusing.

Performance Considerations ✅

  • No performance concerns
  • The timeout mechanism uses existing infrastructure (listen_with_timeout)
  • 60-second timeout is reasonable for preventing indefinite pending states

Security Concerns ✅

No security issues identified.

Test Coverage ⚠️

Missing test coverage:

  • No tests found for the restart policy pending timeout behavior
  • Should add tests verifying:
    • Restart policy actors transition to sleep after 60s pending
    • State is correctly updated when timeout occurs
    • Actors can be woken from this sleep state
    • Behavior differs from CrashPolicy::Sleep actors

Recommendations

Must fix:

  1. Add state management when SpawnActorOutput::Sleep is returned (line 1069) - missing state.sleeping = true and SetSleepingInput activity

Should consider:
2. Add debug logging when restart policy actors transition to sleep due to timeout
3. Consider setting/clearing failure reasons appropriately when timeout occurs
4. Add integration tests for this new behavior path

Nice to have:
5. Document the timeout behavior in user-facing documentation if not already present

Verdict

The core logic is correct and well-implemented, but there's a critical state management issue at line 1069 where the sleeping state isn't properly set. This needs to be fixed before merging.

Overall assessment: Needs minor changes ⚠️

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 12, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3837

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3837

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3837

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3837

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3837

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3837

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3837

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3837

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3837

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3837

commit: e96badf

This was referenced Jan 12, 2026
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from b613527 to b4daf93 Compare January 13, 2026 00:27
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch 2 times, most recently from 363b7a5 to 3f06e1b Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from b4daf93 to 0b9ca2f Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 3f06e1b to 8cd75d7 Compare January 13, 2026 01:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 0b9ca2f to 1a08758 Compare January 13, 2026 01:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 8cd75d7 to 75ce8e5 Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 1a08758 to 5b1c724 Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 5b1c724 to 8152462 Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch 2 times, most recently from 50ba096 to b9a9c50 Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 8152462 to 82ed36d Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 82ed36d to f5a4f68 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from b9a9c50 to 9a4ab5a Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from f5a4f68 to 9e4ae6f Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 9a4ab5a to 33e8961 Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 33e8961 to 633544a Compare January 14, 2026 22:52
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 9e4ae6f to 595d7b8 Compare January 14, 2026 22:52
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from 633544a to e073155 Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch 2 times, most recently from 250667d to 980dd0a Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from e073155 to e96badf Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 980dd0a to 97ace26 Compare January 14, 2026 23:39
@MasterPtato MasterPtato force-pushed the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch from e96badf to 448ef1f Compare January 14, 2026 23:39
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 14, 2026

Merge activity

  • Jan 14, 11:40 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jan 14, 11:41 PM UTC: CI is running for this pull request on a draft pull request (#3908) due to your merge queue CI optimization settings.
  • Jan 14, 11:42 PM UTC: Merged by the Graphite merge queue via draft PR: #3908.

graphite-app bot pushed a commit that referenced this pull request Jan 14, 2026
@graphite-app graphite-app bot closed this Jan 14, 2026
@graphite-app graphite-app bot deleted the 01-12-fix_pb_have_restart_policy_actors_sleep_after_pending_for_too_long branch January 14, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants